23 research outputs found
A New Open Information Extraction System Using Sentence Difficulty Estimation
The World Wide Web has a considerable amount of information expressed using natural language. While unstructured text is often difficult for machines to understand, Open Information Extraction (OIE) is a relation-independent extraction paradigm designed to extract assertions directly from massive and heterogeneous corpora. Allocation of low-cost computational resources is a main demand for Open Relation Extraction (ORE) systems. A large number of ORE methods have been proposed recently, covering a wide range of NLP tools, from ``shallow'' (e.g., part-of-speech tagging) to ``deep'' (e.g., semantic role labeling). There is a trade-off between NLP tools depth versus efficiency (computational cost) of ORE systems. This paper describes a novel approach called Sentence Difficulty Estimator for Open Information Extraction (SDE-OIE) for automatic estimation of relation extraction difficulty by developing some difficulty classifiers. These classifiers dedicate the input sentence to an appropriate OIE extractor in order to decrease the overall computational cost. Our evaluations show that an intelligent selection of a proper depth of ORE systems has a significant improvement on the effectiveness and scalability of SDE-OIE. It avoids wasting resources and achieves almost the same performance as its constituent deep extractor in a more reasonable time
Mismatching-Aware Unsupervised Translation Quality Estimation For Low-Resource Languages
Translation Quality Estimation (QE) is the task of predicting the quality of
machine translation (MT) output without any reference. This task has gained
increasing attention as an important component in the practical applications of
MT. In this paper, we first propose XLMRScore, which is a cross-lingual
counterpart of BERTScore computed via the XLM-RoBERTa (XLMR) model. This metric
can be used as a simple unsupervised QE method, while employing it results in
two issues: firstly, the untranslated tokens leading to unexpectedly high
translation scores, and secondly, the issue of mismatching errors between
source and hypothesis tokens when applying the greedy matching in XLMRScore. To
mitigate these issues, we suggest replacing untranslated words with the unknown
token and the cross-lingual alignment of the pre-trained model to represent
aligned words closer to each other, respectively. We evaluate the proposed
method on four low-resource language pairs of WMT21 QE shared task, as well as
a new English-Farsi test dataset introduced in this paper. Experiments show
that our method could get comparable results with the supervised baseline for
two zero-shot scenarios, i.e., with less than 0.01 difference in Pearson
correlation, while outperforming unsupervised rivals in all the low-resource
language pairs for above 8%, on average.Comment: Submitted to Language Resources and Evaluatio